Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased. Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician
Medicine
name - ASCII subject name and recording number
MDVP:Fo(Hz) - Average vocal fundamental frequency
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
MDVP:Flo(Hz) - Minimum vocal fundamental frequency
MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,S himmer:DDA - Several measures of variation in amplitude
NHR,HNR - Two measures of ratio of noise to tonal components in the voice
status - Health status of the subject (one) - Parkinson's, (zero) - healthy
RPDE,D2 - Two nonlinear dynamical complexity measures
DFA - Signal fractal scaling exponent
spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation 9. car name: string (unique for each instance)
Goal is to classify the patients into the respective labels using the attributes from their voice recordings Steps and tasks:
Loadthedataset
It is always a good practice to eye-ball raw data to get a feel of the data in terms of number of records, structure of the file, number of attributes,types of attributes and a general idea of likely challenges in the dataset. Mention a few comments in this regard (5 points)
import numpy as np # for dataframe handling
import pandas as pd #array handling
import seaborn as sns # plotting
sns.set(color_codes=True)
import matplotlib.pyplot as plt # plotting
%matplotlib inline
# For preprocessing the data
from sklearn import preprocessing
# To split the dataset into train and test datasets
from sklearn.model_selection import train_test_split
from scipy import stats
from sklearn import metrics
# To model the Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix
# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score
# To model the Gaussian Navie Bayes classifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
import warnings
warnings.filterwarnings('ignore')
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
data=pd.read_csv('Data - Parkinsons.txt')
data.head(10)
From the above table I found that the Target Column(Status) is in the middle of DataFrame . For More convenience I am rearanging the Targeted Column to the Last of dataframe
status = data['status'] #adding status to a new dataframe
data.drop(['status'],axis=1,inplace=True)#droping status from the dataset
data['status']=status # appending Targeted Column to the Last of dataframe
data.head()
data.shape #no of rows and columns in the dataframe
There are 195 rows and 24 Columns in the DataFrame
data.dtypes # to get the data type of each attributes
data.isnull().sum() # to check the presence of missing values
As the counts are 0 there is no missing values in the dataframe
data.describe().transpose()
MDVP:Fhi(Hz) Contains outliers and is highly skewed
summary=data.describe().T
summary[['min', '25%', '50%', '75%', 'max']]
data.skew(numeric_only = True)
Skewness with positive values indicates data is skewed towards right. Skewness with negative values indicates data is skewed towards left
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:Fo(Hz)'], showfliers=True).set_title("Distribution of 'MDVP:Fo(Hz)'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:Fo(Hz)'],color='g').set_title("MDVP:Fo(Hz) Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:Fo(Hz)'].plot.hist(color='r').set_title("MDVP:Fo(Hz) Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:Fhi(Hz)'], showfliers=True).set_title("Distribution of 'MDVP:Fhi(Hz)'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:Fhi(Hz)'],color='g').set_title("MDVP:Fhi(Hz) Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:Fhi(Hz)'].plot.hist(color='r').set_title("MDVP:Fhi(Hz) Vs Frequency");
q3 = data['MDVP:Fhi(Hz)'].quantile(0.75)
q1 = data['MDVP:Fhi(Hz)'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['MDVP:Fhi(Hz)'].loc[data['MDVP:Fhi(Hz)']>outliers_above].count())
print(data['MDVP:Fhi(Hz)'].loc[data['MDVP:Fhi(Hz)']<outliers_below].count())
print(data['MDVP:Fhi(Hz)'].loc[data['MDVP:Fhi(Hz)']>outliers_above])
mean_val = data['MDVP:Fhi(Hz)'].loc[data['MDVP:Fhi(Hz)']<=outliers_above].mean()
data['MDVP:Fhi(Hz)'] = data['MDVP:Fhi(Hz)'].mask(data['MDVP:Fhi(Hz)']>outliers_above,mean_val)
print(data['MDVP:Fhi(Hz)'].head(20))
#Distribution after outlier treatment
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:Fhi(Hz)'], showfliers=True).set_title("Distribution of 'MDVP:Fhi(Hz)'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:Fhi(Hz)'],color='g').set_title("MDVP:Fhi(Hz) Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:Fhi(Hz)'].plot.hist(color='r').set_title("MDVP:Fhi(Hz) Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:Flo(Hz)'], showfliers=True).set_title("Distribution of 'MDVP:Flo(Hz)'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:Flo(Hz)'],color='g').set_title("MDVP:Flo(Hz) Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:Flo(Hz)'].plot.hist(color='r').set_title("MDVP:Flo(Hz) Vs Frequency");
q3 = data['MDVP:Flo(Hz)'].quantile(0.75)
q1 = data['MDVP:Flo(Hz)'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['MDVP:Flo(Hz)'].loc[data['MDVP:Flo(Hz)']>outliers_above].count())
print(data['MDVP:Flo(Hz)'].loc[data['MDVP:Flo(Hz)']<outliers_below].count())
print(data['MDVP:Flo(Hz)'].loc[data['MDVP:Flo(Hz)']>outliers_above])
max_val = data['MDVP:Flo(Hz)'].loc[data['MDVP:Flo(Hz)']<=outliers_above].max()
data['MDVP:Flo(Hz)'] = data['MDVP:Flo(Hz)'].mask(data['MDVP:Flo(Hz)']>outliers_above,max_val)
#Distribution after outlier treatment
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:Flo(Hz)'], showfliers=True).set_title("Distribution of 'MDVP:Flo(Hz)'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:Flo(Hz)'],color='g').set_title("MDVP:Flo(Hz) Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:Flo(Hz)'].plot.hist(color='r').set_title("MDVP:Flo(Hz) Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:Jitter(%)'], showfliers=True).set_title("Distribution of 'MDVP:Jitter(%)'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:Jitter(%)'],color='g').set_title("MDVP:Jitter(%) Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:Jitter(%)'].plot.hist(color='r').set_title("MDVP:Jitter(%) Vs Frequency");
q3 = data['MDVP:Jitter(%)'].quantile(0.75)
q1 = data['MDVP:Jitter(%)'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['MDVP:Jitter(%)'].loc[data['MDVP:Jitter(%)']>outliers_above].count())
print(data['MDVP:Jitter(%)'].loc[data['MDVP:Jitter(%)']<outliers_below].count())
print(data['MDVP:Jitter(%)'].loc[data['MDVP:Jitter(%)']>outliers_above])
max_val = data['MDVP:Jitter(%)'].loc[data['MDVP:Jitter(%)']<=outliers_above].max()
data['MDVP:Jitter(%)'] = data['MDVP:Jitter(%)'].mask(data['MDVP:Jitter(%)']>outliers_above,max_val)
#distribution after outlier treatment
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:Jitter(%)'], showfliers=True).set_title("Distribution of 'MDVP:Jitter(%)'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:Jitter(%)'],color='g').set_title("MDVP:Jitter(%) Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:Jitter(%)'].plot.hist(color='r').set_title("MDVP:Jitter(%) Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:Jitter(Abs)'], showfliers=True).set_title("Distribution of 'MDVP:Jitter(Abs)'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:Jitter(Abs)'],color='g').set_title("MDVP:Jitter(Abs) Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:Jitter(Abs)'].plot.hist(color='r').set_title("MDVP:Jitter(Abs) Vs Frequency");
q3 = data['MDVP:Jitter(Abs)'].quantile(0.75)
q1 = data['MDVP:Jitter(Abs)'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['MDVP:Jitter(Abs)'].loc[data['MDVP:Jitter(Abs)']>outliers_above].count())
print(data['MDVP:Jitter(Abs)'].loc[data['MDVP:Jitter(Abs)']<outliers_below].count())
mean_val = data['MDVP:Jitter(Abs)'].loc[data['MDVP:Jitter(Abs)']<=outliers_above].mean()
data['MDVP:Jitter(Abs)'] = data['MDVP:Jitter(Abs)'].mask(data['MDVP:Jitter(Abs)']>outliers_above,mean_val)
# distribution after outlier correction
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:Jitter(Abs)'], showfliers=True).set_title("Distribution of 'MDVP:Jitter(Abs)'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:Jitter(Abs)'],color='g').set_title("MDVP:Jitter(Abs) Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:Jitter(Abs)'].plot.hist(color='r').set_title("MDVP:Jitter(Abs) Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:RAP'], showfliers=True).set_title("Distribution of 'MDVP:RAP'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:RAP'],color='g').set_title("MDVP:RAP Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:RAP'].plot.hist(color='r').set_title("MDVP:RAP Vs Frequency");
q3 = data['MDVP:RAP'].quantile(0.75)
q1 = data['MDVP:RAP'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['MDVP:RAP'].loc[data['MDVP:RAP']>outliers_above].count())
print(data['MDVP:RAP'].loc[data['MDVP:RAP']<outliers_below].count())
mean_val = data['MDVP:RAP'].loc[data['MDVP:RAP']<=outliers_above].mean()
data['MDVP:RAP'] = data['MDVP:RAP'].mask(data['MDVP:RAP']>outliers_above,mean_val)
# distribution after outlier correction
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:RAP'], showfliers=True).set_title("Distribution of 'MDVP:RAP'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:RAP'],color='g').set_title("MDVP:RAP Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:RAP'].plot.hist(color='r').set_title("MDVP:RAP Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:PPQ'], showfliers=True).set_title("Distribution of 'MDVP:PPQ'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:PPQ'],color='g').set_title("MDVP:PPQ Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:PPQ'].plot.hist(color='r').set_title("MDVP:PPQ Vs Frequency");
q3 = data['MDVP:PPQ'].quantile(0.75)
q1 = data['MDVP:PPQ'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['MDVP:PPQ'].loc[data['MDVP:PPQ']>outliers_above].count())
print(data['MDVP:PPQ'].loc[data['MDVP:PPQ']<outliers_below].count())
max_val = data['MDVP:PPQ'].loc[data['MDVP:PPQ']<=outliers_above].max()
data['MDVP:PPQ'] = data['MDVP:PPQ'].mask(data['MDVP:PPQ']>outliers_above,max_val)
# distribution after outlier treatment
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:PPQ'], showfliers=True).set_title("Distribution of 'MDVP:PPQ'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:PPQ'],color='g').set_title("MDVP:PPQ Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:PPQ'].plot.hist(color='r').set_title("MDVP:PPQ Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['Jitter:DDP'], showfliers=True).set_title("Distribution of 'Jitter:DDP'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['Jitter:DDP'],color='g').set_title("Jitter:DDP Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['Jitter:DDP'].plot.hist(color='r').set_title("Jitter:DDP Vs Frequency");
q3 = data['Jitter:DDP'].quantile(0.75)
q1 = data['Jitter:DDP'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['Jitter:DDP'].loc[data['Jitter:DDP']>outliers_above].count())
print(data['Jitter:DDP'].loc[data['Jitter:DDP']<outliers_below].count())
max_val = data['Jitter:DDP'].loc[data['Jitter:DDP']<=outliers_above].max()
data['Jitter:DDP'] = data['Jitter:DDP'].mask(data['Jitter:DDP']>outliers_above,max_val)
# distribution after outlier correction
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['Jitter:DDP'], showfliers=True).set_title("Distribution of 'Jitter:DDP'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['Jitter:DDP'],color='g').set_title("Jitter:DDP Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['Jitter:DDP'].plot.hist(color='r').set_title("Jitter:DDP Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:Shimmer'], showfliers=True).set_title("Distribution of 'MDVP:Shimmer'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:Shimmer'],color='g').set_title("MDVP:Shimmer Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:Shimmer'].plot.hist(color='r').set_title("MDVP:Shimmer Vs Frequency");
q3 = data['MDVP:Shimmer'].quantile(0.75)
q1 = data['MDVP:Shimmer'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['MDVP:Shimmer'].loc[data['MDVP:Shimmer']>outliers_above].count())
print(data['MDVP:Shimmer'].loc[data['MDVP:Shimmer']<outliers_below].count())
mean_val = data['MDVP:Shimmer'].loc[data['MDVP:Shimmer']<=outliers_above].mean()
data['MDVP:Shimmer'] = data['MDVP:Shimmer'].mask(data['MDVP:Shimmer']>outliers_above,mean_val)
# distribution after outlier correction
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:Shimmer'], showfliers=True).set_title("Distribution of 'MDVP:Shimmer'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:Shimmer'],color='g').set_title("MDVP:Shimmer Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:Shimmer'].plot.hist(color='r').set_title("MDVP:Shimmer Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:Shimmer(dB)'], showfliers=True).set_title("Distribution of 'MDVP:Shimmer(dB)'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:Shimmer(dB)'],color='g').set_title("MDVP:Shimme Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:Shimmer(dB)'].plot.hist(color='r').set_title("MDVP:Shimmer(dB) Vs Frequency");
q3 = data['MDVP:Shimmer(dB)'].quantile(0.75)
q1 = data['MDVP:Shimmer(dB)'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['MDVP:Shimmer(dB)'].loc[data['MDVP:Shimmer(dB)']>outliers_above].count())
print(data['MDVP:Shimmer(dB)'].loc[data['MDVP:Shimmer(dB)']<outliers_below].count())
mean_val = data['MDVP:Shimmer(dB)'].loc[data['MDVP:Shimmer(dB)']<=outliers_above].mean()
data['MDVP:Shimmer(dB)'] = data['MDVP:Shimmer(dB)'].mask(data['MDVP:Shimmer(dB)']>outliers_above,mean_val)
# distribution after outlier correction
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:Shimmer(dB)'], showfliers=True).set_title("Distribution of 'MDVP:Shimmer(dB)'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:Shimmer(dB)'],color='g').set_title("MDVP:Shimmer(dB) Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:Shimmer(dB)'].plot.hist(color='r').set_title("MDVP:Shimmer(dB) Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['Shimmer:APQ3'], showfliers=True).set_title("Distribution of 'Shimmer:APQ3'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['Shimmer:APQ3'],color='g').set_title("Shimmer:APQ3 Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['Shimmer:APQ3'].plot.hist(color='r').set_title("Shimmer:APQ3 Vs Frequency");
q3 = data['Shimmer:APQ3'].quantile(0.75)
q1 = data['Shimmer:APQ3'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['Shimmer:APQ3'].loc[data['Shimmer:APQ3']>outliers_above].count())
print(data['Shimmer:APQ3'].loc[data['Shimmer:APQ3']<outliers_below].count())
mean_val = data['Shimmer:APQ3'].loc[data['Shimmer:APQ3']<=outliers_above].mean()
data['Shimmer:APQ3'] = data['Shimmer:APQ3'].mask(data['Shimmer:APQ3']>outliers_above,mean_val)
# distribution after outlier correction
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['Shimmer:APQ3'], showfliers=True).set_title("Distribution of 'Shimmer:APQ3'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['Shimmer:APQ3'],color='g').set_title("Shimmer:APQ3Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['Shimmer:APQ3'].plot.hist(color='r').set_title("Shimmer:APQ3Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['Shimmer:APQ5'], showfliers=True).set_title("Distribution of 'Shimmer:APQ5'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['Shimmer:APQ5'],color='g').set_title("Shimmer:APQ5 Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['Shimmer:APQ5'].plot.hist(color='r').set_title("Shimmer:APQ5 Vs Frequency");
q3 = data['Shimmer:APQ5'].quantile(0.75)
q1 = data['Shimmer:APQ5'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['Shimmer:APQ5'].loc[data['Shimmer:APQ5']>outliers_above].count())
print(data['Shimmer:APQ5'].loc[data['Shimmer:APQ5']<outliers_below].count())
mean_val = data['Shimmer:APQ5'].loc[data['Shimmer:APQ5']<=outliers_above].mean()
data['Shimmer:APQ5'] = data['Shimmer:APQ5'].mask(data['Shimmer:APQ5']>outliers_above,mean_val)
# distribution after outlier correction
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['Shimmer:APQ5'], showfliers=True).set_title("Distribution of 'Shimmer:APQ5'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['Shimmer:APQ5'],color='g').set_title("Shimmer:APQ5 Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['Shimmer:APQ5'].plot.hist(color='r').set_title("Shimmer:APQ5 Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:APQ'], showfliers=True).set_title("Distribution of 'MDVP:APQ'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:APQ'],color='g').set_title("MDVP:APQ Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:APQ'].plot.hist(color='r').set_title("MDVP:APQ Vs Frequency");
q3 = data['MDVP:APQ'].quantile(0.75)
q1 = data['MDVP:APQ'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['MDVP:APQ'].loc[data['MDVP:APQ']>outliers_above].count())
print(data['MDVP:APQ'].loc[data['MDVP:APQ']<outliers_below].count())
mean_val = data['MDVP:APQ'].loc[data['MDVP:APQ']<=outliers_above].mean()
data['MDVP:APQ'] = data['MDVP:APQ'].mask(data['MDVP:APQ']>outliers_above,mean_val)
# distribution after outlier correction
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['MDVP:APQ'], showfliers=True).set_title("Distribution of 'MDVP:APQ'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['MDVP:APQ'],color='g').set_title("MDVP:APQ Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['MDVP:APQ'].plot.hist(color='r').set_title("MDVP:APQ Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['Shimmer:DDA'], showfliers=True).set_title("Distribution of 'Shimmer:DDA '")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['Shimmer:DDA'],color='g').set_title("Shimmer:DDA Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['Shimmer:DDA'].plot.hist(color='r').set_title("Shimmer:DDA Vs Frequency");
q3 = data['Shimmer:DDA'].quantile(0.75)
q1 = data['Shimmer:DDA'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['Shimmer:DDA'].loc[data['Shimmer:DDA']>outliers_above].count())
print(data['Shimmer:DDA'].loc[data['Shimmer:DDA']<outliers_below].count())
mean_val = data['Shimmer:DDA'].loc[data['Shimmer:DDA']<=outliers_above].mean()
data['Shimmer:DDA'] = data['Shimmer:DDA'].mask(data['Shimmer:DDA']>outliers_above,mean_val)
# distribution after outlier correction
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['Shimmer:DDA'], showfliers=True).set_title("Distribution of 'Shimmer:DDA'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['Shimmer:DDA'],color='g').set_title("Shimmer:DDA Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['Shimmer:DDA'].plot.hist(color='r').set_title("Shimmer:DDA Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['NHR'], showfliers=True).set_title("Distribution of 'NHR'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['NHR'],color='g').set_title("NHR Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['NHR'].plot.hist(color='r').set_title("NHR Vs Frequency");
q3 = data['NHR'].quantile(0.75)
q1 = data['NHR'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['NHR'].loc[data['NHR']>outliers_above].count())
print(data['NHR'].loc[data['NHR']<outliers_below].count())
mean_val = data['NHR'].loc[data['NHR']<=outliers_above].mean()
data['NHR'] = data['NHR'].mask(data['NHR']>outliers_above,mean_val)
# distribution after outlier correction
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['NHR'], showfliers=True).set_title("Distribution of 'NHR'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['NHR'],color='g').set_title("NHR Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['NHR'].plot.hist(color='r').set_title("NHR Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['HNR'], showfliers=True).set_title("Distribution of 'HNR'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['HNR'],color='g').set_title("MDVP:Shimme Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['HNR'].plot.hist(color='r').set_title("HNR Vs Frequency");
q3 = data['HNR'].quantile(0.75)
q1 = data['HNR'].quantile(0.25)
t = q3-q1
outliers_above = q3+t
outliers_below = q1-t
print("outliers_above : {}".format(outliers_above))
print("outliers_below : {}".format(outliers_below))
print(data['HNR'].loc[data['HNR']>outliers_above].count())
print(data['HNR'].loc[data['HNR']<outliers_below].count())
mean_val = data['HNR'].loc[data['HNR']>outliers_below].mean()
data['HNR'] = data['HNR'].mask(data['HNR']<outliers_below,mean_val)
# distribution after outlier correction
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['HNR'], showfliers=True).set_title("Distribution of 'HNR'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['HNR'],color='g').set_title("HNR Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['HNR'].plot.hist(color='r').set_title("HNR Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['RPDE'], showfliers=True).set_title("Distribution of 'RPDE'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['RPDE'],color='g').set_title("MDVP:Shimme Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['RPDE'].plot.hist(color='r').set_title("RPDE Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['D2'], showfliers=True).set_title("Distribution of 'D2'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['D2'],color='g').set_title("D2 Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['D2'].plot.hist(color='r').set_title("D2 Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['DFA'], showfliers=True).set_title("Distribution of 'DFA'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['DFA'],color='g').set_title("DFA Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['DFA'].plot.hist(color='r').set_title("DFA Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['spread1'], showfliers=True).set_title("Distribution of 'spread1'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['spread1'],color='g').set_title("spread1 Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['spread1'].plot.hist(color='r').set_title("spread1 Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['spread2'], showfliers=True).set_title("Distribution of 'spread2'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['spread2'],color='g').set_title("spread2 Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['spread2'].plot.hist(color='r').set_title("spread2 Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,2,1)
sns.boxplot(data['PPE'], showfliers=True).set_title("Distribution of 'PPE'")
#distplot
plt.subplot(3,2,2)
sns.distplot(data['PPE'],color='g').set_title("PPE Vs Frequency")
#histogram plot
plt.subplot(3,2,3)
data['PPE'].plot.hist(color='r').set_title("PPE Vs Frequency");
sns.pairplot(data.iloc[:,1:],hue='status')
plt.figure(figsize=(25, 25))
ax = sns.heatmap(data.corr(), vmax=.8, square=True, fmt='.2f', annot=True, linecolor='white', linewidths=0.01)
plt.title('Correlation of Attributes')
plt.show()
The Target Column 'Status' has comparitively high correlation with Spread 1, spread 2 and PPE and is Positively associated with all other attributes exeot for MDVP:Fo,MDVP:Fhi ,MDVP Flo and HNR for which it is negatively associated
MDVP:Fo(Hz) is negatively associated with most of the attributes but is +vely associated to minimum and maximum vocal fundamental frequency
All the measures of variation in 'Fundamental frequency' and variation in 'amplitude' has strong positive association with eachother ,have positive association with'non linear dynamic complexity measures' and 'non linear measures of fundamental frequency variation' also have high Negative association with 'NHR'
NHR has Negative association with almost all attributes except maximum vocal fundamental frequency
The two non linear dynamic complexity measures(RPDE,D2) has a very slight -ve association with eachother and are positively associated with most of the variables except for vocal fundamental frequencies and NHR which shows a negative association
The 3 'non linear measures of fundamental frequency variation' have positive asoociation with eachother and almost all the attributes except the three vocal fundamental frequencies for which it has a negative association
DFA has slight positive association with most of the attributes and has a comparitively high correlation with average and maximum vocal fundamental frequencies
status_counts = pd.DataFrame(data["status"].value_counts()).reset_index()
status_counts.columns =["Labels","status"]
status_counts
From the above value We can see that the cases for Parkinson's disease is more than twice when compared to the healthy cases
ax=sns.countplot(y="status", data=data)
plt.title('Distribution of status ')
plt.xlabel('Count')
total = len(data["status"])
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_width()/total)
x = p.get_x() + p.get_width() + 0.02
y = p.get_y() + p.get_height()/2
ax.annotate(percentage, (x, y))
plt.show()
From the above graph we understand 75.4% of the cases have Parkinson's disease and rest are Healthy
for i in data:
if i != 'status' and i != 'name':
sns.catplot(x='status',y=i,kind='box',data=data)
Healthy Persons are having more vocal fundamental frequency,HNR, than the person having the Disease all the other values are Comparitively Higher for Diseased Persons than Healthy Persons
plt.figure(figsize=(20,20))
plt.subplot(3,2,1)
sns.stripplot(data['status'], data['spread1'])
plt.subplot(3,2,2)
sns.stripplot(data['status'], data['spread2'])
plt.subplot(3,2,3)
sns.stripplot(data['status'], data['PPE'])
plt.figure(figsize=(10,5))
sns.distplot( data[data['status'] == 0]['spread1'], color = 'g',label='status=0')
sns.distplot( data[data['status'] == 1]['spread1'], color = 'b',label='status=1')
plt.legend()
spread1 values are higher for persons those who have the disease
plt.figure(figsize=(10,5))
sns.distplot( data[data['status'] == 0]['spread2'], color = 'g',label='status=0')
sns.distplot( data[data['status'] == 1]['spread2'], color = 'b',label='status=1')
plt.legend()
spread2 values are slightly higher for persons those who have the disease
plt.figure(figsize=(10,5))
sns.distplot( data[data['status'] == 0]['PPE'], color = 'g',label='status=0')
sns.distplot( data[data['status'] == 1]['PPE'], color = 'b',label='status=1')
plt.legend()
PPE values are slightly higher for persons those who have the disease
The column'name' is not relevent for our model building so we will drop it
data = data.drop(['name'], axis=1)
data.head()
X=data.drop('status', axis=1)
y=data[['status']]
X.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)# Fit the model on train
columns = X_train.columns
print('x train data {}'.format(X_train.shape))
print('y train data {}'.format(y_train.shape))
print('x test data {}'.format(X_test.shape))
print('y test data {}'.format(y_test.shape))
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
model = LogisticRegression()
solvers = ['liblinear']
penalty = ['l1', 'l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
lg = LogisticRegression(solver="liblinear", C=0.1, penalty='l1')
lg.fit(X_train, y_train)
y_predicted=lg.predict(X_test)
model_score=lg.score(X_test, y_test)
log_accuracy = accuracy_score(y_test,y_predicted)
recall_score=metrics.recall_score(y_test, y_predicted, average='binary')
precision_score=metrics.precision_score(y_test, y_predicted, average='binary')
f1_score=metrics.f1_score(y_test, y_predicted, average='binary')
print('Logistic Regression Model Accuracy Score : {}'.format(log_accuracy))
print('Model Recall Score : {}'.format(recall_score))
print('Model Precision Score : {}'.format(precision_score))
print('Model F1 Score : {}'.format(f1_score))
resultsDf = pd.DataFrame({'Method':['Log Regression'], 'accuracy': log_accuracy})
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
print('\nK-NN classification Report : \n',metrics.classification_report(y_test,y_predicted))
#confusion matrix
lg_cm=metrics.confusion_matrix(y_test,y_predicted, labels=[1, 0])
df_cm = pd.DataFrame(lg_cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True,fmt='g')
model = KNeighborsClassifier()
n_neighbors = range(1, 21, 2)
weights = ['distance']
metric = ['euclidean', 'manhattan', 'minkowski']
# define grid search
grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
# creating odd list of K for KNN
myList = list(range(1,50))
# subsetting just the odd ones
neighbors = list(filter(lambda x: x % 2 != 0, myList))
# empty list that will hold accuracy scores
ac_scores = []
# perform accuracy metrics for values from 1,3,5....19
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# predict the response
y_pred = knn.predict(X_test)
# evaluate accuracy
scores = accuracy_score(y_test, y_pred)
ac_scores.append(scores)
# changing to misclassification error
MSE = [1 - x for x in ac_scores]
# determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)
import matplotlib.pyplot as plt
# plot misclassification error vs k
plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
#Use k=1 as the final model for prediction
knn = KNeighborsClassifier(n_neighbors =optimal_k,weights = 'distance',metric='manhattan' )
# fitting the model
knn.fit(X_train, y_train)
# predict the response
y_pred = knn.predict(X_test)
# evaluate accuracy
knn_accuracy_score=accuracy_score(y_test, y_pred)
print('KNN_Model Accuracy Score : {}'.format(knn_accuracy_score))
tempResultsDf = pd.DataFrame({'Method':['KNN'], 'accuracy': [knn_accuracy_score]})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
resultsDf
print('\nK-NN classification Report : \n',metrics.classification_report(y_test, y_pred))
#confusion matrix
knn_cm=metrics.confusion_matrix(y_test,y_pred, labels=[1, 0])
df_knn_cm = pd.DataFrame(knn_cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_knn_cm, annot=True,fmt='g')
nb_model = GaussianNB()
nb_model.fit(X_train,y_train)
nb_y_pred = nb_model.predict(X_test)
nb_score = nb_model.score(X_test, y_test)
nb_accuracy = accuracy_score(y_test, nb_y_pred)
print('NB_Model Accuracy Score : {}'.format(nb_accuracy))
tempResultsDf = pd.DataFrame({'Method':['Naive Bayes'], 'accuracy': [nb_accuracy]})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
resultsDf
print('\nK-NN classification Report : \n',metrics.classification_report(y_test, nb_y_pred))
#confusion matrix
nb_cm=metrics.confusion_matrix(y_test, nb_y_pred, labels=[1, 0])
df_nb_cm = pd.DataFrame(nb_cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_nb_cm, annot=True,fmt='g')
model = DecisionTreeClassifier()
parameters = {'max_depth':[1,2,3,4,5], 'min_samples_leaf':[1,2,3,4,5],'min_samples_split':[2,3,4,5],'criterion':['gini','entropy']}
# define grid search
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=parameters, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
# invoking the decision tree classifier function
dTree = DecisionTreeClassifier(criterion = 'entropy', random_state=1,max_depth=4,min_samples_leaf=3,min_samples_split=2)
dTree.fit(X_train, y_train)
print(dTree.score(X_train, y_train))
print(dTree.score(X_test, y_test))
from IPython.display import Image
#import pydotplus as pydot
from sklearn import tree
from os import system
Parkinson_File = open('parkinson.dot','w')
dot_data = tree.export_graphviz(dTree, out_file=Parkinson_File, feature_names = list(columns))
Parkinson_File.close()
system("dot -Tpng parkinson.dot -o parkinson.png")
Image("parkinson.png")
print('Feature Importance for Decision Tree Classifier '
)
feature_importances = pd.DataFrame(dTree.feature_importances_, index = X.columns,
columns=['Importance']).sort_values('Importance', ascending = True)
feature_importances.sort_values(by = 'Importance', ascending = True).plot(kind = 'barh', figsize = (15, 7.2));
print('\nK-NN classification Report : \n',metrics.classification_report(y_test, y_pred))
#confusion matrix
Dtree_accuracy=dTree.score(X_test , y_test)
print('Decision Tree Accuracy Score : {}'.format(Dtree_accuracy))
dt_y_predict = dTree.predict(X_test)
dt_cm=metrics.confusion_matrix(y_test, dt_y_predict, labels=[1, 0])
df_cm = pd.DataFrame(dt_cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt="d")
tempResultsDf = pd.DataFrame({'Method':['Decision Tree'], 'accuracy': [Dtree_accuracy]})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
#taking KNN,Nave bayes and Logistic regression as a model and Decision Tree as Meta Classifier
from mlxtend.classifier import StackingClassifier
models=[knn,nb_model,lg]
sclf = StackingClassifier(classifiers=models,
meta_classifier=dTree)
sclf.fit(X_train,y_train)
y_preds=sclf.predict(X_test)
stack_accuracy=accuracy_score(y_test,y_pred)
stack_accuracy
st_cm=metrics.confusion_matrix(y_test, dt_y_predict, labels=[1, 0])
df_cm = pd.DataFrame(st_cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt="d")
tempResultsDf = pd.DataFrame({'Method':['Stacking Classifier'], 'accuracy': [stack_accuracy]})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
model = RandomForestClassifier()
n_estimators = [10, 100, 200,300]
max_features = ['sqrt', 'log2']
# define grid search
grid = dict(n_estimators=n_estimators,max_features=max_features)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
rfcl = RandomForestClassifier(n_estimators = 300, max_features = 'sqrt')
rfcl = rfcl.fit(X_train, y_train)
rf_y_predict = rfcl.predict(X_test)
rf_acc=rfcl.score(X_test, y_test)
print('Random Forest Accuracy Score : {}'.format(rf_acc))
rf_cm=metrics.confusion_matrix(y_test, rf_y_predict, labels=[1, 0])
df_cm = pd.DataFrame(rf_cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt="d")
tempResultsDf = pd.DataFrame({'Method':['Random Forest'], 'accuracy': [rf_acc]})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
model = BaggingClassifier()
n_estimators = [10, 100, 200, 300]
# define grid search
grid = dict(n_estimators=n_estimators)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
bgcl = BaggingClassifier(base_estimator=dTree, n_estimators=300,random_state=1)
bgcl = bgcl.fit(X_train, y_train)
from sklearn.metrics import confusion_matrix
bg_y_predict = bgcl.predict(X_test)
bg_acc=bgcl.score(X_test , y_test)
print('Bagging Model Accuracy Score : {}'.format(bg_acc))
bg_cm=metrics.confusion_matrix(y_test, bg_y_predict, labels=[1, 0])
df_cm = pd.DataFrame(bg_cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt="d")
tempResultsDf = pd.DataFrame({'Method':['Bagging'], 'accuracy': [bg_acc]})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
model = AdaBoostClassifier()
n_estimators = [10, 50, 100, 200, 300]
# define grid search
grid = dict(n_estimators=n_estimators)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
abcl = AdaBoostClassifier(n_estimators=300, random_state=1)
abcl = abcl.fit(X_train, y_train)
ab_y_predict = abcl.predict(X_test)
ab_acc=abcl.score(X_test , y_test)
print('AdaBoostingModel Accuracy Score : {}'.format(ab_acc))
ab_cm=metrics.confusion_matrix(y_test, ab_y_predict, labels=[1, 0])
df_cm = pd.DataFrame(ab_cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt="d")
tempResultsDf = pd.DataFrame({'Method':['AdaBoosting'], 'accuracy': [ab_acc]})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
model = GradientBoostingClassifier()
n_estimators = [10, 50, 100, 200, 300]
# define grid search
grid = dict(n_estimators=n_estimators)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
gbcl = GradientBoostingClassifier(n_estimators = 300,random_state=1)
gbcl = gbcl.fit(X_train, y_train)
gb_y_predict = gbcl.predict(X_test)
gb_acc=gbcl.score(X_test, y_test)
print('Gradient Boost Model Accuracy Score : {}'.format(gb_acc))
gb_cm=metrics.confusion_matrix(y_test, gb_y_predict, labels=[1, 0])
df_cm = pd.DataFrame(gb_cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt="d")
tempResultsDf = pd.DataFrame({'Method':['Gradient Boost'], 'accuracy': [gb_acc]})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
models = []
models.append(('Logistic Regression',lg))
models.append(('KNeighbour',knn))
models.append(('Naive Bayes',nb_model))
models.append(('Decision Tree',dTree))
models.append(('Random Forest Classifier', rfcl))
models.append(('Bagging Classifier', bgcl))
models.append(('Ada Boosting Classifier',abcl))
models.append(('Gradient Boosting Classifier',gbcl))
from sklearn import model_selection
# Evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for model_name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=12000)
scores = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
results.append(scores)
names.append(model_name)
msg = "Model Name: %s, Mean of Accuracy : %f, Std of Accuracy: %f" % (model_name, scores.mean(), scores.std())
print(msg)
resultsDf
ax=sns.barplot(y="Method", x=("accuracy"),data=resultsDf)
total = len(resultsDf["accuracy"])
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_width())
x = p.get_x() + p.get_width() + 0.02
y = p.get_y() + p.get_height()/2
ax.annotate(percentage, (x, y))
plt.show()
Confusion Matrix
print('\nLogistic Regression: \n', lg_cm)
print('\nTrue Possitive = ', lg_cm[0][0])
print('True Negative = ', lg_cm[1][1])
print('False Possive = ', lg_cm[0][1])
print('False Negative = ', lg_cm[1][0])
print('\nK-Nearest Neighbors: \n', knn_cm)
print('\nTrue Possitive = ', knn_cm[0][0])
print('True Negative = ', knn_cm[1][1])
print('False Possive = ', knn_cm[0][1])
print('False Negative = ', knn_cm[1][0])
print('\nNaive Bayes: \n', nb_cm)
print('\nTrue Possitive = ', nb_cm[0][0])
print('True Negative = ', nb_cm[1][1])
print('False Possive = ', nb_cm[0][1])
print('False Negative = ', nb_cm[1][0])
print('\nDecision Tree: \n', dt_cm)
print('\nTrue Possitive = ', dt_cm[0][0])
print('True Negative = ', dt_cm[1][1])
print('False Possive = ', dt_cm[0][1])
print('False Negative = ', dt_cm[1][0])
print('\nStacking: \n', st_cm)
print('\nTrue Possitive = ', st_cm[0][0])
print('True Negative = ', st_cm[1][1])
print('False Possive = ', st_cm[0][1])
print('False Negative = ', st_cm[1][0])
print('\nRandom Forest: \n', rf_cm)
print('\nTrue Possitive = ', rf_cm[0][0])
print('True Negative = ', rf_cm[1][1])
print('False Possive = ', rf_cm[0][1])
print('False Negative = ', rf_cm[1][0])
print('\nBagging: \n', bg_cm)
print('\nTrue Possitive = ', bg_cm[0][0])
print('True Negative = ', bg_cm[1][1])
print('False Possive = ', bg_cm[0][1])
print('False Negative = ', bg_cm[1][0])
print('\nAdaboost: \n', ab_cm)
print('\nTrue Possitive = ', ab_cm[0][0])
print('True Negative = ', ab_cm[1][1])
print('False Possive = ', ab_cm[0][1])
print('False Negative = ', ab_cm[1][0])
print('\nGradient Boosting: \n', bg_cm)
print('\nTrue Possitive = ', bg_cm[0][0])
print('True Negative = ', bg_cm[1][1])
print('False Possive = ', bg_cm[0][1])
print('False Negative = ', bg_cm[1][0])
Type I and Type II errors are least in Decision Tree
Gradient boos is also having a High accuracy od 89.8%
All the Ensemble Techniques are Having more accuracy than the base models
Minimum 300 trees were required in every ensemble model, which added to higher accuracy.
Computation time was relatively less in Ensemble models
Scaling of data also plays an important role
Hypertuning of data in ensemble models takes very high computational time as they are having more than 500 trees to run.
K-fold cross validation is giving a better idea of models as it is training the whole set.